This report explores a dataset containing 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. By using R, we are going to identify which chemical properties influence the quality of white wines.
Tip: In this section, you should perform some preliminary exploration of your dataset. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your dataset. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
According to the bar plot, over 2000 wines are rated as “6”. Most of the wines are rated between “5”" and “7”. Only 5 wines are “9”.
It can be found the all the distribution above are right skewed. For the distribution of citric.acid, another peak is appeared at around 0.4.
Some more histograms are plotted.
The outlier of residual sugar are shown below:
## [1] 23.50 31.60 31.60 65.80 26.05 26.05 22.60
## Warning: Removed 5 rows containing non-finite values (stat_bin).
After removing the outlier, the distribution of residual sugar is shown above. A high peak appears at the left end and a low peak appears in the middle.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
## [1] 0.009
The distribution of chlorides is left skewed. The range of chlorides is from 0.009 to 0.346. The outliers are the values which are larger or equal to 0.09.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## [1] 81
The distribution of free sulfur dioxide is left skewed. The max value is 289 and the minimum value is 2. The outliers are the values which are larger or equal to 81.
## Warning: Removed 6 rows containing non-finite values (stat_bin).
Total sulfur dioxide is normal distributed. The max value is 440 and the minimum value is 9. The mean is 138.4.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
Density is normal distributed. The max value is 1.039 and the minimum value is 0.987. The mean is 0.994.
pH value is normal distributed. The max value is 3.82 and the minimum value is 2.72. The mean is 3.188.
The distribution of sulphates is left skewed. The max value is 1.08 and the minimum value is 0.22. The mean is 0.490.
The distribution of alcohol is left skewed. The max value is 14.2 and the minimum value is 8. The mean is 10.51.
The distribution of fixed_volatile ratio is left skewed. The max value is 90 and the minimum value is 5.545. The mean is 27.657.
Free total ratio is normal distributed. The max value is 0.71 and the minimum value is 0.24. The mean is 0.256.
This dataset contains 4898 observations and 12 variables. The summary of 12 variables is shown below:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality fixed_volatile_ratio free_total_ratio
## 3: 20 Min. : 5.545 Min. :0.02362
## 4: 163 1st Qu.:20.627 1st Qu.:0.19093
## 5:1457 Median :26.071 Median :0.25368
## 6:2198 Mean :27.657 Mean :0.25558
## 7: 880 3rd Qu.:33.000 3rd Qu.:0.31579
## 8: 175 Max. :90.000 Max. :0.71053
## 9: 5
According to the documentation, volatile acidity, citric acid, residual sugar, sulfur dioxide, sulphates and alcohol can influence the taste. Therefore, those features will be focused in later study.
Density and pH can help support the investigation because the density is influenced by alcohol and sugar content and pH is a result of acidity.
I created two new variables which are fixed/volatile acidity ratio and free/total sulfur dioxide ratio. As acidity and sulfur dioxide are inportant features to determine the taste, it is crucial to play around with those two ratios.
The distribution of residual sugar is unusual. There is a large peak at the begining. After that, a low peak appears in the middle.
As quality is a categorical, boxplots are used to study the relationship between quality and each variable.
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing non-finite values (stat_summary).
## Warning: Removed 3 rows containing missing values (geom_point).
As quality increases, the fixed acidity will initially decrease and then increase again.
As quality increases, the volatile acidity will initially decrease and then increase again.
As quality increases, the median of fixed_volatile ratio will increase.
## Warning: Removed 110 rows containing non-finite values (stat_boxplot).
## Warning: Removed 110 rows containing non-finite values (stat_summary).
## Warning: Removed 110 rows containing missing values (geom_point).
As quality increases, the chlorides will decrease.
## Warning: Removed 22 rows containing non-finite values (stat_boxplot).
## Warning: Removed 22 rows containing non-finite values (stat_summary).
## Warning: Removed 31 rows containing missing values (geom_point).
As quality increases, the citric acid will increase.
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).
## Warning: Removed 3 rows containing non-finite values (stat_summary).
## Warning: Removed 3 rows containing missing values (geom_point).
The distributions of residual sugar are similar among the different quality group.
## Warning: Removed 17 rows containing non-finite values (stat_boxplot).
## Warning: Removed 17 rows containing non-finite values (stat_summary).
## Warning: Removed 17 rows containing missing values (geom_point).
Group 5, 6, 7, 8 and 9 have a similar distribution which are higher than group 3 and 4.
As quality increases, the total sulfur dioxide will decrease.
As quality increases, the free_total ratio will increase.
## Warning in ggcorr(wineQualityWhites[, 1:14], label = TRUE, label_size =
## 3, : data in column(s) 'quality' are not numeric and were ignored
It can be found that most of the variables are not correlated. The three highest corrlations are 0.84, -0.82 and -0.78. Thus, I plotted scatter plots for the residual sugar - density and alcohol - density for investigation.
## Warning: Removed 3 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: wineQualityWhites$density and wineQualityWhites$residual.sugar
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8304732 0.8470698
## sample estimates:
## cor
## 0.8389665
Residual sugar and density are positively correlated and the correlation is 0.84.
## Warning: Removed 3 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: wineQualityWhites$density and wineQualityWhites$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
Alcohol and density are positively correlated and the correlation is 0.84.
The relationships between quality and features of interest are presented by boxplot. The results are summarized as following:
“+” indicates a positive correlation, “-” for negative relationship.
The relationship between pH and fixed acidity and relationship between density and alcohol are ploted as well. It can be clearly found that the pH goes down when the fixed acidity decreases. In addition, the density goes up when the alcohol decreases.
The strongest relationship appears between residual sugar and density. The correlation is 0.84.
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
## Warning: Removed 6 rows containing non-finite values (stat_summary).
## Warning: Removed 18 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).
I initially plotted the relationship between chlorides and citric acid and I found the good quality wines are located at center bottom. Therefore, I think the quality may be affected by chlorides/citric.acid ratio. After that, I plotted the boxplot for chlorides/citric.acid ratio and quality. It can be seen that higher quality wines have higher median chlorides/citric.acid ratio. Thus, wines with chlorides/citric.acid ratio have a higher chance to be good wines.
Chlorides/ctiric.acid ratio can reflect the chemical contents. On the other hand, the properties of wine itself, such as alcohol and pH, can be other parameters to affect the taste. Same to the chlorides/ctiric.acid ratio, the boxplot of alcohol/pH ratio is also plotted. It can be found that the quality becomes better as the alcohol/pH ratio increases.
Finally, I plotted the scatter plot for chlorides/citric.acid ratio and alcohol/pH ratio and the quality is distinguished by color. It can be seen that the good wines located top right corner. Therefore, the combination of high alcohol/pH ratio and high chlorides/ctiric.acid ratio indicates good wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Based on my intution, the taste of the white wine is largely affected by acidity which is pH value in this case. I want to know how the pH value distributed. It can be found that the distribution of pH value is normal distributed. The middle 50% of the wines ranges from 3.09 to 3.28. pH can reflect the acidity. Either the acidity is too low or too high will result in bad taste.
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).
As it is dicussed previously, the alcohol and density are correlated. I want to go further to see how the correlation changes when quality increases. From the plot, it can be seen that the correlation become stronger as the quality increases. However, correlation of group “9” become small again. This may be caused by small number of observations in this group.
## Warning: Removed 23 rows containing missing values (geom_point).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.673 7.692 8.116 10.000 75.455
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.339 2.983 3.216 3.303 3.587 4.681
The scatter plot for chlorides/citric.acid ratio and alcohol/pH ratio is shown in Plot Three and the quality is distinguished by color. It can be seen that the good wines located top right corner. Therefore, the combination of high alcohol/pH ratio and high chlorides/ctiric.acid ratio indicates good wines.
I started by looking at the data and trying to find patterns. By examing different variables and their relationship using plots, I was able to have a clear understanding of factors affecting wine qualities. This signifies the importance of Exploratory Data Analysis(EDA) in data science. Therefore, I would treat this as one of my successes.
As I am new to wine industry, the terminologies drive me crazy. Therefore, better understanding of the data, particularly, the variables - what they represent, what are the units, how are they generated, what is the relationship with other variables - these are all questions that are worthwhile of being asked and can greatly speed up the feature selection and analysis process.
In this project, we identified that chlorides/citric.acid ratio and alcohol/pH ratio can affect the quality of wine. However, as the the quality of wine is a discrete variable, I cannot build a model to predict the quality based on the two ratios. In the future, I may use machine learning may to solve this problem.